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r- H Abstract. The Primal-Dual hybrid gradient (PDHG) method is a powerful optimiza- 

C ~ ) tion scheme that breaks complex problems into simple sub-steps. Unfortunately, PDHG 

O^l methods require the user to choose stepsize parameters, and the speed of convergence is 

highly sensitive to this choice. We introduce new adaptive PDHG schemes that automat- 
ically tune the stepsize parameters for fast convergence without user inputs. We prove 
rigorous convergence results for our methods, and identify the conditions required for 
convergence. We also develop practical implementations of adaptive schemes that for- 
mally satisfy the convergence requirements. Numerical experiments show that adaptive 
PDHG methods have advantages over non-adaptive implementations in terms of both 
efficiency and simplicity for the user. 
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1. Introduction 
This manuscript considers saddle-point problems of form 
(!) minmax/(ic) + (Ax,y) - g(y) 

x£X y£Y 

where / and g are convex functions, A £ R MxJV jg a matrix, and X C M. N and Y C K M are 
convex sets. 

The formulation (1) is particularly appropriate for solving inverse problems involving the 
ti norm. Such problems take the form 

(2) mm\Sx\ + ^\\Bx-f\\ 2 

where | • | denotes the t\ norm, B and S are linear operators, and || • || is the £2 norm. 
The formulation (2) is useful because it naturally enforces sparsity of Sx. Many different 
problems can be addressed by choosing different values for the sparsity transform S. 

In the context of image processing, S is frequently the gradient operator. In this case 
the t\ term becomes |Vu|, the total variation semi-norm [1]. Problems of this form arise 
whenever an image is to be recovered from incomplete or noise-contaminated data. In this 
case, B is often a Gaussian blur matrix or a sub-sampled fast transform, such as a Fourier 
or Hadamard transform. 

As we will see below, the problem (2) can be put in the "saddle-point" form (1) and can 
then be addressed using the techniques described below. Many other common minimiza- 
tion problems can also be put in this common form, including image segmentation, TVL1 
minimization, and general linear programing. 

In many problems of practical interest, / and g do not share common properties, making 
it difficult to derive numerical schemes for (1) that address both terms simultaneously. 
However, it frequently occurs in practice that efficient algorithms exist for minimizing / 
and g independently. In this case, the Primal-Dual Hybrid Gradient (PDHG) [2, 3] method 
is quite useful. This method removes the coupling between / and g, enabling each term to 
be addressed separately. Because it decouples / and g, the steps of PDHG can often be 
written explicitly, as opposed to other splitting methods that require expensive minimization 
sub-problems. 

One of the primary difficulties with PDHG is that it relies on step-size parameters that 
must be carefully chosen by the user. The speed of the method depends heavily on the 
choice of these parameters, and there is often no intuitive way to choose them. 

We will present practical adaptive schemes that optimize the PDHG parameters automat- 
ically as the problem is solved. Out new methods are not only much easier to use in practice, 
but also result in much faster convergence than constant-stepsize schemes. After introduc- 
ing the adaptive methods, we prove new theoretical results that guarantee convergence of 
PDHG under very general circumstances, including adaptive stepsizes. 

1.1. Notation. Given two vectors u,v € R N , we will denote their discrete inner product 
by (u, v) — J2i u i v i- I n some situations, we will also use the "dot product" notation, u ■ v = 
(u,v). 

The formulation (1) can be generalized to handle complex-valued vectors. In this case, 
we will consider the real part of the inner product. We use the notion 3?{-} to denote the 
real part of a vector or scalar. 

We will make use of a variety of norms, including the ^2-norm, ||w|| = ^/Si u f j an d the 
4-norm, |u| = J2i \ u i\- 
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When M is a symmetric positive definite matrix, we define the M-norm by 

\\uf M = (Mu,u), 

If M is indefinite, this does not define a proper norm. In this case, we will still write 
to denote the quantity (Mm, it), even though this is an abuse of notation. 
For a matrix M, we can also define the "operator norm" 

ii. ,11 W Mu W 
M op = max . 

u \\u\\ 

If M is symmetric, then the operator norm is the spectral radius of M, which we denote by 
p(M). 

We use the notation df to denote the sub-differential (i.e., generalized derivative) of a 
function /. 

Finally, we will use xc to denote the characteristic function of a convex set C, which is 
defined as follows: 

'o, ifxeC 

oo, otherwise. 



Xc{x) 



2. The Primal-Dual Hybrid Gradient Method 

The PDHG scheme has its roots in the well-known Arrow-Hurwicz scheme, which was 
originally proposed in the field of economics and later refined for solving saddle point prob- 
lems by Popov [4]. While the simple structure of the Arrow-Hurwicz method made it 
appealing, tight stepsize restrictions and poor performance make this method impractical 
for many problems. 

Research in this direction was reinvigorated by the introduction of PDHG, which con- 
verges rapidly for a wide range of stepsizes. PDHG originally appeared in a technical report 
by Zhu and Chan [5]. It was later analyzed for convergence in [2, 3], and studied in the 
context of image segmentation in [6] . An extensive technical study of the method and its 
variants is given by He and Yuan [7]. 

The PDHG method for solving (1) only requires us to evaluate the ■proximal operators of 
/ and g. This operators are given by 

PDHG is listed in Algorithm 1. The algorithm can be interpreted as a forward-backward 
algorithm in both the primal and dual parameters. In steps (2-3), the method decreases 
the energy (1) in x by first taking a gradient descent step with respect to the inner product 
term in (1), and then taking a "backward" or proximal step for /. In steps (5-6), the energy 
(1) is increased by changing y. A gradient ascent step is taken with respect to the inner 
product term, and then a backward step is taken with respect to g. 



Algorithm 1 Basic PDHG 



Require: x Q e R N , y Q e R M , a k , r k > 
while Not Converged do 
Xk+i =%k — T k A T y k 

x k+1 = argmin^ f(x) + ±-\\x - x k+1 \\ 2 
Sfc+i = x k+ i + (xk+i - x k ) 
Vk+l = Vk + °~kAx k+ i 
y k+1 = argmin yey g(y) + ^\\y - y k+1 \\ 2 
end while 
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Note that steps 3 and 6 of Algorithm 1 require minimizations. These minimization steps 
can be written in a more compact form using the proximal operators of / and g: 

Jtf(x) = argmin/(x) + — ||a: - x\\ 2 
(3) X£X 1 

J<tg(v) = axgming(y) + — \\y - y\\ 2 



Algorithm 1 has been analyzed in the case of constant stepsizes, r k = r and a k = a. 
In particular, it is known to converge as long as err < jrgrj\ [2, 3, 7]. However, PDHG 
typically does not converge when non-constant stepsizes are used, even in the case that 

CfcTfc < p (A T A)- 

In this article, we identify the specific stepsize conditions that guarantee convergence and 
propose practical adaptive methods that enforce these conditions. 

Remark. Step 4 of Algorithm 1 is a prediction step of the form Xk+i = ^k+i +#(^fc+i — x k ), 
where 9 = 1. Note that PDHG methods have been analyzed for 9 E [—1, 1]. Some authors 
have even suggested non-constant values of 9 as a means of accelerating the method [3]. 
However, it is known that the case 9 = 1 tolerates the largest stepsizes (see [7]), and as a 
result 9 = 1 has been used almost exclusively in applications. 

2.1. Optimality Conditions and Residuals. Problem (1) can be written in its uncon- 
strained form using the characteristic functions for the sets X and Y. We have 

(4) min max f(x) + Xx(x) + (Ax,y) - g(y) - xv{y)- 

Let F = d{f + Xx(x)}, and G = d{g + Xr(y)} 1 - From the equivalence between (1) and (4), 
we see that the optimality conditions for (1) are 

(5) € F(x*) + A T y* 

(6) e G(y*) - A(x*). 

The optimality conditions (5,6) state that the derivative with respect to both the primal 
and dual variables must be zero. This motivates us to define the primal and dual residuals 

P(x,y) = F(x) + A T y 
D(x,y) = G{y)-Ax. 

We can measure convergence of the algorithm by tracking the size of these residuals. Note 
that the residuals are in general multi- valued (because they depend on the sub-differentials of 
/ and g, as well as the characteristic functions of X and Y). We can obtain explicit formulas 
for these residuals by observing the optimality conditions for step 2 and 4 of Algorithm 1: 

(8) OeF(x k+1 ) + A T y k + —{x k+1 -x k ) 

Tk 

(9) e G(y k+1 ) - A(2x k+1 - x k ) + —(y k+1 - y k ). 

Ok 



^Note that d{f + Xx( x )} — 9f + dxx( x ) precisely when the resolvent minimizations (3) are feasible 
(see Rockafellar [8] Theorem 23.8), and in this case the optimality conditions (4) admit a solution. We will 
assume for the remainder of this article that the problem (4) is feasible. 
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Manipulating these optimality conditions yields 

(10) P k +i = — {x k - Xk+i) - A T (y k - yk+i) e F(x k+1 ) + A T y k+1 

(11) D k+ i = — (y k - yk+i) - A{x k - x k +i) e G(y k +i) - Ax k+1 . 

&k 

Formulas (10) and (11) define a sequence of primal and dual residual vectors such that 
P k E P(x k ,y k ) and D k £ D(x k ,yk)- 
We say that Algorithm 1 converges if 

lim ||P fc || 2 + ||P fe || 2 =0. 

k— too 

Note that studying the residual convergence of Algorithm 1 is more general than studying 
the convergence of subsequences of iterates, because the solution to (1) need not be unique. 

3. Common Saddle-Point Problems 

Many common variational problems have saddle-point formulations that are efficiently 
solved using PDHG. While the applications of saddle-point problems are vast, we focus here 
on several simple problems that commonly appear in image processing. 

3.1. Total- Variation Denoising. A ubiquitous problem in image processing is minimizing 
the Rudin-Osher-Fatemi (ROF) denoising model [1]: 

(12) minlVxI + ^la;-/!! 2 . 

The energy (12) contains two terms. The ti term on the right minimizes the squared error 
between the recovered image and the noise-contaminated measurements, /. The TV term, 
|Vx|, enforces that the recovered image be smooth in the sense that its gradient has sparse 
support. 

The problem(12) can be formulated as a saddle-point problem by writing the TV term as 
a maximization over the "dual" variable y € M 2xAr , where the image x € M. N has N pixels: 

(13) TV(x) = \Vx\ = max y ■ Vx. 

IMIoo<l 

Note that the maximization is taken over the set 

C 00 = {yeB 2 * N \yl+yl t <l}. 
The TV-regularized problem (12) can then be written in its primal-dual form 

(14) max min ^lla; — /II 2 + y ■ Vx 

yGCoo x 2 

which is clearly a saddle-point problem in the form (1) with / = — /|| 2 , A = V, g = 0, 
X = R N , and Y = C^. 

To apply Algorithm 1, we need efficient solutions to the sub-problems in steps 3 and 6, 
which can be written 

(15) J TF (x) = argmin £\\x - /|| 2 + ^- \\x - x\\ 2 = T (/i/ + -x) 

x 2 2r TH + 1 t 



M 

(16) JaG(y) = argmin -^-\\y - y\\ 2 = , { , 



1 II.. .-.112 ( Vi 
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3.2. TVL1. Another common denoising model is TVL1 [9]. This corresponds to minimizing 
the energy 

(17) mm|Vx|+/i|a;-/| 

X 

which is similar to (12), except that the data term relies on the i\ norm instead of l 2 . This 
model is very effective for problems involving "heavy-tailed" noise, such as shot noise or 
salt-and-pepper noise. 

To put the energy (17) into the form (1), we write the data term in its variational form 

n\x- f\ = max(y,x-/) 

ye Cm 

where C M = {y e R N | \yt\ < //}. 

The energy (17) can then be minimized using the formulation 

(18) max minyi ■ Vx + (y 2 ,x) - (y 2 , /) 
which is of the form (1). 

3.3. Globally Convex Segmentation. Image segmentation is the task of grouping pixels 
together by intensity and spatial location. One of the simplest models for image segmen- 
tation is the globally convex segmentation model of Chan, Esedoglu, and Nikolova (CEN) 
[10, 11]. Given an image / and two real numbers c\ and c 2 we wish to partition the pixels 
into two groups depending on whether their intensity lies closer to c\ or c 2 . Simultaneously, 
we want the regions to have smooth boundaries. 

The CEN segmentation model has the variational form 

(19) min\Vx\ + (x,l) 

0<x<l 

where U — (fi — C1) 2 — (/, — c 2 ) 2 . The inner-product term of the right forces the entries in x 
toward either 1 or 0, depending on whether the corresponding pixel intensity in / is closer 
to ci or c 2 . At the same time, the TV term enforces that the boundary is smooth. 
Using the identity (13), we can write the model (19) as 

(20) max min y ■ Vx + (x, I). 

yeCooxelo.i] 

We can then apply Algorithm 1, where step 3 takes the form 

Jtf{x) = max{0, min{a;, 1}} 

and step 6 is given by (16). 

Generalizations of (19) to multi-label segmentations and general convex regularizers have 
been presented in [12, 13, 14, 15]. Many of these models result in minimizations similar to 
(19), and PDHG methods have become a popular scheme for solving these models. 

3.4. Compressed Sensing / Single-Pixel Cameras. In compressed sensing, one is in- 
terested in reconstructing images from incomplete measurements taken in an orthogonal 
transform domain (such as the Fourier or Hadamard domain). It has been shown that high- 
resolution images can be obtained from incomplete data as long as the image has a sparse 
representation [16, 17], for example a sparse gradient. 

The Single Pixel Camera (SPC) is an imaging modality that leverages compressed sensing 
to reconstruct images using a small number of measurements [18]. Rather than measuring 
in the pixel domain like conventional cameras, SPC's measure a subset of the Hadamard 
transform coefficients of an image. 
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To reconstruct images, SPC's rely on variational problems of the form 
(21) \Vx\ + ^\\RHx-b\\ 2 

where b contains the measured transform coefficients of the image, and H is an orthogonal 
transform matrix (such as the Hadamard transform). Here, R is a diagonal "row selector" 
matrix, with diagonal elements equal to 1 if the corresponding Hadamard row has been 
measured, and if it has not. 

This problem can be put into the saddle-point form 

max min — \\RHx — b\\' 2 + y ■ Vi 
ySCoo x 2 

and then solved using PDHG. 

The solution to step 6 of Algorithm 1 is given by (16). Step 3 requires we solve 

1 



jma^\\RHx-b\r + —\\x-£ 
x 2 2t 



2 



Because H is orthogonal, we can write the solution to this problem explicitly as 

x = H T (^R + ~l\ H (^iH T Rb + -x 

Note that for compressed sensing problems of the form (21), PDHG has the advantage 
that all steps of the method can be written explicitly, and no expensive minimization sub- 
steps are needed. 

3.5. £oo Minimization. In wireless communications, one is often interested in finding signal 
representations with low peak-to-average power ratio (PAPR) [19, 20]. Given a signal z with 
high dynamic range, we can obtain a more efficient representation by solving 

(22) min ||x||oo subject to \\Dx — z\\ < e 

X 

where x is the new representation, D is the representation basis, and e is the desired level 
of representation accuracy [20]. Minimizations of the form (22) also arise in numerous other 
applications, including robotics [21] and nearest neighbor search [22]. 
The problem (22) can be written in the constrained form 

min ||xi | |oo subject to xi = Dx\ — z 

where C e — {z\ \\z\\ < e}. When we introduce Lagrange multipliers y to enforce the equality 
constraint, we arrive at the saddle-point formulation 

max min \\xi |joo + (y, Dx\ — X2 — z) 

y£R M x 1 £R N ,x 2 £C e 

which is of the form (1). 

It often happens that D is a complex- valued frame (such as a sub-sampled Fourier matrix) . 
In this case, x and y can take on complex values. The appropriate saddle-point problem is 
then 

max min ||xi ||oo + 3t{(y, Dx\ — x-i — z)\ 

yem. M x!&r n ,x 2 ec e 

and we can apply PDHG just as we did for real- valued problems provided we are careful to 
use the Hermitian definition of the matrix transpose (see the remark at the end of Section 
5.2). 
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To apply PDHG to this problem, we need to evaluate the proximal minimization 

min IMIoo + ^ll^-^ll 2 
x Zt 

which can be computed efficiently using a sorting algorithm [23] . 

3.6. Linear Programming. Suppose we wish to solve a large-scale linear program in the 
canonical form 

min (c, x) 

(23) 

subject to Ax < b, x > 

where c and b are vectors, and A is a matrix [24]. This problem can be written in the 
primal-dual form 

maxmin(c, x) + (y, Ax — b) 

y>0 x>0 

where y can be interpreted as Lagrange multipliers enforcing the condition Ax < b [24]. 
Steps 3 and 6 of Algorithm 1 are now simply 

J t f{x) = max{i — re, 0} 

J<?a{y) = max{y - ab, 0}. 

The PDHG solution to a linear program is advantageous for extremely large problems for 
which conventional simplex and interior-point solvers are intractable. Primal-dual solvers 
for (23) have been studied in [25], and compared to conventional solvers. 

4. Residual Balancing 

When choosing the stepsize for PDHG, there is a tradeoff between the primal and dual 
residuals. Choosing a large value of Tfc creates a very powerful minimization step in the 
primal variables and a slow maximization step in the dual variables, resulting in very small 
primal residuals at the cost of large dual residuals. Choosing r k to be small, on the other 
hand, results in small dual residuals at the cost of large primal errors. 

Ideally, one would like to choose stepsizes so that the larger of P k and D k is as small as 
possible. If we assume the primal/dual residuals decrease/increase monotonically with Tfe, 
then max{Pfc, D k } is minimized when both residuals are equal in magnitude. This suggests 
that Tfc be chosen to "balance" the primal and dual residual - i.e., the primal and dual 
residuals should be roughly the same size, up to some scaling to account for units. This 
principle has been suggested for other iterative methods (see [26, 27]). 

The particular residual balancing principle we suggest is to enforce 

(24) \P k \~s\D k \ 

where | • | denotes the £\ norm, and s is a scaling parameter. Residual balancing methods 
work by "tuning" parameters after each iteration to approximately maintain this equality. 
If the primal residual grows disproportionately large, T k is increased and a k is decreased (or 
vice- versa) . 

In (24) we use the £i-norm to measure the size of the residuals. In principle, l\ could 
easily be replaced with the £2 norm, or any other norm. We have found that the l\ norm 
performs somewhat better than the £2 norm for some problems, because it is less sensitive 
to outliers that may dominate the residual. 

Note the proportionality in (24) depends on a constant s. This is to account for the effect 
of scaling the inputs (i.e., changing units). For example, in the TVL1 model (17) the input 
image / could be scaled with pixel intensities in [0, 1] or [0, 255]. As long as the primal step 
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sizes used with the latter scaling are 255 times that of the former (and the dual step sizes 
are 255 times smaller) both problems produce similar sequences of iterates. However, in the 
[0, 255] case the dual residuals are 255 times larger than in the [0, 1] case while the primal 
residuals are the same. 

For image processing problems, we recommend using the [0,255] scaling with s = 1, 
or equivalently the [0, 1] scaling with s = 255. This scaling enforces that the saddle point 
objective (1) is nearly maximized with respect to y and that the saddle point term (13) is 
a good approximation of the total variation semi-norm. 

5. Adaptive Methods 

In this section, we develop new the adaptive PDHG methods. The first method assumes 
we have a bound on p(A T A). In this case, we can enforce the stability condition 

(25) r k a k <L< p{A T A)-\ 

This is the same stability condition that guarantees convergence in the non-adaptive case. In 
the adaptive case, this condition alone is not sufficient for convergence but leads to relatively 
simple methods. 

The second method does not require any knowledge of p(A T A). Rather, we use a back- 
tracking scheme to enforce that the stepsizes are small enough to guarantee convergence. 
This is similar to the Armijo-type backtracking line searches that are used in conventional 
convex optimization. In addition to requiring no knowledge of p(A T A), the backtracking 
scheme has the advantage that it can use relatively long steps that violate the stability 
condition (25). Especially for problems with highly sparse solutions, this can lead to faster 
convergence than methods that enforce (25). 

Both of these methods have strong convergence guarantees. In particular, we show in 
Section 6 that both methods converge in the sense that the norm of the residuals goes to 
zero. 

5.1. Adaptive PDHG. The first adaptive method is listed in Algorithm 2. The loop in 
Algorithm 2 begins by performing a standard PDHG step using the current stepsizes. In 
steps 4 and 5, we compute the primal and dual residuals and store their i\ norms in p k and 
dk- If the primal residual is sufficiently large compared to the dual residual then the primal 
stepsize is increased by a factor of (1 — afc) -1 , and the dual stepsize is decreased by a factor 
of ( 1 — ak ) ■ If the primal residual is somewhat smaller than the dual residual then the primal 
stepsize is decreased and the dual stepsize is increased. If both residuals are comparable in 
size, then the stepsizes remain the same on the next iteration. 

The parameter A > 1 is used to compare the sizes of the primal and dual residuals. The 
stepsizes are only updated if the residuals differ by a factor greater then A. 

The sequence {a k } controls the adaptivity level of the method. We start with some 
a 6 (0, 1). Every time we choose to update the stepsize parameters, we define afc+i = rja^ 
for some rj < 1. In this way, the adaptivity decreases over time. We will show in Section 6 
that this definition of {afc} is needed to guarantee convergence of the method. 

Remark. The adaptive scheme requires several arbitrary constants as inputs. These are 
fairly easy to choose in practice. A fairly robust choice is oo = 0.5, A = 1.5, r\ = 0.95, 
and t = (T = l/\/p(A T A). However, these parameters can certainly be tuned for various 
applications. As stated above, we recommend choosing s — 1 for imaging problems when 
pixels lie in [0, 255], and s — 255 when pixels lie in [0, 1]. 
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Algorithm 2 Adaptive PDHG 

Require: x Q eR N ,y e M M ,cr r < p(A T A)- 1 , 
l: while p k , d k > tolerance do 
Xk+l = Jr k F(xk - T k A T y k ) 
Vk+i = J<j k a{yk + a k A(2x k+ i - x k )) 
Pk+i = |(xfc - x k +i)/T k - A T (y k - Vk+\)\ 
d k+ i = \(y k - y k+ i)/a k - A(x k ~ x k+ \)\ 
if pk+i > sdk+iA then 
Tfc + i = r k /{l - a k ) 
ffe+i = (Jfc(l - 
ak+i = Ci k r\ 
end if 

if pk+i < sdk+i/A then 

Tk + l = Tfe(l - Qffc) 

^+1 = o-fc/d- - «fe) 

afc+i = akV 
end if 

if sd fc+ i/A < pk+i < sd k+ iA then 

Tk + l = T k 

ffe+i = Qfc 
afe+i = a* 
end if 
end while 



€ (0, l) 2 , A > l,s > 

> Begin with normal PDHG 

> Compute primal residual 

> Compute dual residual 

> If primal residual is large... 

> Increase primal stepsize 

> Decrease dual stepsize 

> Decrease adaptivity level 

> If dual residual is large... 

> Decrease primal stepsize 

> Increase dual stepsize 

> Decrease adaptivity level 

> If residuals are similar... 

> Leave primal step the same 

> Leave dual step the same 

> Leave adaptivity level the same 



Remark. The computation of the residuals in steps 4 and 5 of Algorithm 2 requires multipli- 
cations by A T and A, which seems at first to increase the cost of each iteration. Note however 
that the values of A T yk and Ax k are already computed in the other steps of the algorithm. 
If we simply note that A T (y k - Vk+i) = A T y k - A T y k +i and A(x k - x k +i) = Ax k - Ax k +i, 
then we can evaluate the residuals in steps 4 and 5 without any additional multiplications 
by A or A T . 

5.2. Backtracking PDHG. In situations where bounds on p(A T A) are unavailable, or 
when the stability condition (25) is overly conservative, the backtracking method presented 
here becomes valuable. PDHG with backtracking is presented in Algorithm 3. 

The backtracking scheme is similar to the simple adaptive method (Algorithm 2) with 
one key difference. We will see in Section 6 that convergence is guaranteed if the following 
"backtracking" condition holds at each step: 

(OR , , 2T k a k (A(x k+1 ~ x k ),y k +i-yk) 

26 ) bk = 11 U2~, il ii2 - L 

ja k \\xk+i - x k \\ 2 +jT k \\y k+ i - yk\r 

where 7 e (0, 1) is a constant. If this inequality does not hold, then the stepsizes are too 
large. In this case we decrease the stepsizes by a factor of where f3 € (0, 1) is a constant. 
Reasons for this particular update choice are elaborated in the Section 6.4. 

We will show in Section 6 that this backtracking step can only be activated a finite 
number of times. Furthermore, enforcing the condition (26) at each step is sufficient to 
guarantee convergence, provided that one of the spaces X or Y is bounded. 

We recommend choosing 7 = 0.75 and (3 = 0.95, although the method is convergent for 
any 7,0 6(0,1). 



ADAPTIVE PRIMAL-DUAL HYBRID GRADIENT METHODS FOR SADDLE-POINT PROBLEMS 11 



Algorithm 3 Backtracking PDHG 

Require: x € R N , y & K A/ , (r 0) a , s) > 0, A > 1, (77, a , 0, 7) € (0, l) 4 



1 


while Pk,dk > tolerance do 




2 


Xk+l — Jr k F{xk — TkA T l)k) 


> Begin with normal PDHG 


3 


Vk+1 = Jrr, o(Vk + <TkA(2Xk+l — XhY) 




4 


= Ifa^/t — xta-i )/ti. — A T (ijk — )\ 


o Compute primal residual 


5 


<ii.+i = \(vk — yh+-\)/ah — A(xh — xi.+\)\ 


o Compute dual residual 


g 


^ 2T fc cr fc (A(a; fc+ i-a;fe),i/ fe+ i-y fc ) 


n> (inTnnntp stfinilitv rnnnilinn 


/ 


if hi ^ 1 
11 Ofc > 1 


X It ol o l~n m 1~\ t cntirliTinrt f 01 Ic 
11 SldUllliy L.U11111 11U11 lelllb... 


Q 
O 


Tk+1 = f^Tk/bk, (Tk+1 — P<Tk/bk 


l> l^tLlCelbL. bLLpMZLti 




x k+l — x k, Vk+1 — Vk 


fx K ODn *~i 1 *H l T^yQ loc 

ix iveep uiu iLciaieo 


10 


O-k+1 — O>o 


i> Reset adaptivity parameter 


1 1 

1 1 


eise 11 Pk+i > sdk+ii-i 


[> If primal residual is large... 


12 


r n- 111 ^ \ 

Tk+i = Tk/(l - ak) 


D> Increase primal stepsize 


13 


(Tk+i = er*(l — ak) 


D> Decrease dual stepsize 


11 


Otk+l = C£ k 1] 


o Decrease adaptivity level 


15 


else if p k +i < sdk+i/A 


> If dual residual is large... 


16 


Tk+i = Tfc(l - a fe ) 


o Decrease primal stepsize 


17 


Cfc+1 = <7fc/(l - Qffe) 


\> Increase dual stepsize 


18 


a/c+i = afc?7 


o Decrease adaptivity level 


19 


else if sd k +i/A < p k +i < seifc+iA 


> If residuals are similar... 


20 


Tk+1 = Tk 


o Leave primal step the same 


21 


(Tk+1 = (Tk 


\> Leave dual step the same 


22 


ctk+i = ak 


[> Leave adaptivity level the same 


23 


end 




24 


end while 





The initial stepsizes To and Co should be chosen so that Toco is somewhat larger than 
p(A T A) -1 . In the numerical experiments below, we choose 




(27) t = a = 

where x r is a random Gaussian distributed vector. Note that the ratio ^nunr 1 ^ is guaranteed 

to be less than p(A T A), and thus t (To > p(A T A)~ 1 . The factor of 2 is included to account 
for problems for which the method is stable with large steps. 

Remark. Note that our convergence theorems require either X or Y to be bounded. This 
requirement is indeed satisfied by most of the common saddle-point problems described 
in Section 3. However, we have found empirically that the scheme is quite stable even 
for unbounded X and Y. For the linear programming example, neither space is bounded. 
Nonetheless, the backtracking method converges reliably in practice. 

Remark. In the case of complex-valued problems, both x and y may take on complex 
values, and the relevent saddle-point problem is 

min max /(x) + $l{(Ax,y}} - g{y) 



12 



TOM GOLDSTEIN, ERNIE ESSER, RICHARD BARANIUK 



where 3?{-} denotes the real part of a vector. In this case the numerator of b k may have an 
imaginary component. The algorithm is still convergent as long as we replace b k with its 
real part and use the definition 

^ 2g ^ ^ _ %t{2T k a k (A(x k+1 -x k ),y k+1 - y k )} 

ja k \\x k+1 - x k \\ 2 +7T fc ||y fc+ i - y k \\ 2 

in Algorithm 3. 



6. Convergence Theory 

6.1. Variational Inequality Formulation. For notational simplicity, we define the vector 
quantities 



t7.H -A T \ „ fr- l I 



and the matrices 

This notation was first suggested to simplify PDHG by He and Yuan [7]. 

Following [7] , it is possible to formulate each step of PDHG as a solution to the variational 
inequality 

(31) OeR(u k+1 )+M k (u k+1 -u k ). 

Also, the optimality conditions (5) and (6) can be written succinctly as 

(32) G R(u*). 
Note that R is monotone (see [28]), meaning that 

(u — u, R(u) — R(u)) > 0, Vtt, u. 



6.2. Stepsize Conditions. Here, we state the conditions that guarantee convergence of 
adaptive PDHG. We begin by defining the following constants: 

d k = mm { , , 1 



T k <Tk 

(33) 4> k = 1 - S k > 0. 

These constants quantify how much the stepsizes decrease with each iteration. 
Consider the following conditions on the step-sizes in Algorithm 1: 
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Convergence Conditions for Adaptive PDHG 

Algorithm 1 converges if the following three requirements hold: 
A: The sequences {r/c} and {<Jk} are bounded. 
B: The sequence {<pk} is summable, i.e. 

(j> k < oo. 

fe>0 

C: One of the following two conditions is met: 

CI: There is a constant L such that for all k > 

r k a k <L< p(A T A)- 1 . 

C2: Either X or Y is bounded, and there is a 
constant c 6 (0,1) such that for all k > 

IK+l - UfcllM* > C||«fe+1 - u k\\ 2 H k - 

Two conditions must always hold to guarantee convergence: A The stepsizes must remain 
bounded, and Bthe sequence {(pi} must be summable. Together, these conditions ensure 
that the steps sizes do not oscillate too wildly as k gets large. 

In addition, either CI or C2must hold. When we know the spectral radius of A T A, we 
can use condition CI to enforce that the method is stable. When we have no knowledge of 
p(A T A), or when we want to take large stepsizes, the backtracking condition C2can also 
be used to enforce stability. We will see that for small enough stepsizes, this backtracking 
condition can always be satisfied. 

Note that condition CI is slightly stronger than condition C2 . The advantage of condi- 
tion CI is that it results is somewhat simpler methods because backtracking does not need 
to be used. However when backtracking is used to enforce condition C2 , we can take larger 
steps that violate condition CI . 

In the following sub-sections, we explain how Algorithms 2 and 3 explicitly satisfy con- 
ditions A, Band C, and are therefore guaranteed to converge. In Section 6.5, we prove 
convergence of the general Algorithm 1 under assumptions A , B and C . 



6.3. Convergence of the Adaptive Method. Algorithm 2 is a practical implementation 
of adaptive PDHG that satisfies conditions A , B , and CI . An examination of Algorithm 
2 reveals that, depending on the balance between the residuals, there are three possible 
stepsize updates. Regardless of which update occurs, the product Tk<Jk remains unchanged 
at each iteration, and so r k a k = two = L < p(A T A)~ 1 . It follows that condition A and 
CI arc satisfied. 

Also, because < a k < 1, we have 

. r Tk+i &k+i I otk , if the steps are updated 

0fc = l-mm{ , ' 1 l = i 1 

Tfc Cfc II, otherwise. 

Note that every time we update the stepsizes, we update ak+i = fjak where 77 < 1. Thus 
the non-zero entries in the sequence {<pk} form a decreasing geometric sequence. It follows 
that the summation condition B is satisfied. 

Because Algorithm 2 satisfies conditions A , B and C , its convergence is guaranteed by 
the theory presented in Section 6.5. 
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6.4. Convergence of the Backtracking Method. Algorithm 3 uses the same adaptive 
stepsize updates as Algorithm 2. As we shall prove in Section 6.5, the backtracking step is 
triggered only a finite number of times. Thus, the backtracking method satisfies conditions 
A and B . 

We can expand condition C2 using the definition of || • ||jvf fc and || • \\n k to obtain 

— \\x k+1 - x k \\ 2 - 2(A(x k +i - x k ),y k+1 - y k ) + — \\y k+ i - y k \\ 2 
T k a k 

> —\\x k+1 - x k \\ 2 + — \\yk+i - Vk\\ 2 - 
Tk <J k 

If we combine like terms and let 7 = 1 — c, then this condition is equivalent to 

(34) — ||x fc+ i - x k \\ 2 + —\\yk+i - Vk\\ 2 > 2(A(x k+1 - x k ),y k+1 - y k ). 

T k a k 

If we note that the left side of (34) is non-negative, then we can easily sec that this condition 
is equivalent to (26). Thus, the backtracking conditions (26) explicitly enforce condition C2 , 
and the convergence of Algorithm 3 is guaranteed by the analysis below. 

We now address the form of the backtracking update in line 8 of Algorithm 3. A simple 
Armijo-type backtracking scheme would simply choose r k+ i — ^ k r k and o k +i — £<7fe, where 
£ < 0, every time that (34) is violated. However if our initial guess for L is very large, then 
this backtracking scheme could be very slow. Rather, we wish to predict the value of £ that 
makes (34) hold. To do this, we assume that the values of \\x k+ i — x k \\ and \\yk+i — y k \\ 
decrease linearly with £. Under this assumption, the value of £ that makes (34) into an 
equality is ^ = . In order to guarantee that the stepsizes can become arbitrarily small, 
we make the slightly more conservative choice of £ k = f3/b k for /3 < 1. 

6.5. Convergence Proof. In this section, we prove convergence of Algorithm 1 under 
assumptions A , B and C . 

The overall strategy of this proof is to show that the PDHG method satisfies a Fejer 
inequality, i.e., an inequality stating that the iterates move toward the solution set on every 
iteration. We derive a simple Fejer inequality in Lemma 2. This lemma states that u k+ i 
always lies closer to the solution set than u k as measured in the M k norm. Normally, this 
would be enough to prove convergence (see [28], chapter 5). However, in our case the proof 
is not so straight-forward for several reasons. First, the Ai^-norm changes every time we 
update the stepsizes. Second, when the backtracking condition C2is used, M k may be 
indefinite, in which case || • \\M k is not a proper norm, and is can even assume negative 
values. 

Nonetheless, we can still prove convergence. We begin by proving that CI guarantees 
that M k is positive definite. We then derive a Fejer inequality in Lemma 2. In cases when 
M k is indefinite, Lemmas 3 and 4 show that terms involving the Mfe-norm remain uniformly 
bounded over all iterations. These bounding conditions will be strong enough to allow us 
to complete the proof in the case that M k is indefinite. 

This first lemma has been adapted from He and Yuan [7]. It shows that the stability 
condition CI forces the matrix M k to be positive definite. When this condition is satisfied, 
the operator (u, M k u) — ||u||A/ fc is a proper norm and can be used in our convergence 
analysis. 

Lemma 1 also shows that the backtracking step in Algorithm 3 can only be triggered 
a finite number of times. When the product r k a k becomes sufficiently small, so does the 
constant Cm in the lemma, and the ratio (26) will always be less than 1. 



ADAPTIVE PRIMAL-DUAL HYBRID GRADIENT METHODS FOR SADDLE-POINT PROBLEMS 15 



Lemma 1. // to~\\ A T A\\ op < 1 then M is positive definite, and we have 

(1 - Cm) (hy\\ 2 + kx\\A < \\uf M < (1 + C M ) (-\\y\\ 2 + i||x|| 



where Cm = \/ T0 ~\\A T A\\ op . 
Proof. By definition 

(35) (u 1 Mu) = -\\x\\ 2 ~2(Ax,y) + -\\y\\ 2 . 

t a 

Now use the inequality a 2 + b 2 > 2ab to obtain 

2 { A X ,y)<2 \\M^\\v\Ur.\\A-A U i 
K ,y> ~ {ra\\ATA\\ op )\ ^ " " 

(36) < UTA b T .. \\yf + y/™\\ ATA \\°r \\ x f 



V™U T A\\ 



Op 



1„ „o 1 



I^IU ( ~||l/|| 8 +~N| 2 

. 0~ T 



Applying 36 to 35 yeilds 



\u\\ 2 M > (l- ^to\\ATA\\ op ) [\\\ 2 + \\\ x ^ 



We also have that 



and so 



(u,Mu) = -\\x\\ 2 -2(Ax,y) + -\\y\\ 2 
t a 

< l\\ x \\ 2 + 2\\A\\ op \\x\\\\y\\ + -\\y\\ 2 

T G 

< -\\x\\ 2 + -\\y\\ 2 + Jra\\ATA\\ op (-\\y\\ 2 + -||x|| 2 



1„ „o 1 



nf M < 1 + Jto-\\ATA\\ op \ ( -|| 2/ || 2 + -||x|| 2 



a t 



□ 



We next show that the distance between the true solution and the PDHG iterates is 
decreasing in the M k norm. If M k were constant and positive definite, then this result 
would be sufficient for convergence. However, because the matrix M k changes at each 
iteration and may be indefinite, this condition does not guarantee convergence on its own. 



Lemma 2. The iterates generated by Algorithm 1 satisfy 

IK - u *\\ 2 M h > IK+i - u k\\ 2 Mk + IK+i - ""Ik- 
Proof. Subtracting (31) from (32) gives us 

M k (u k+1 - u k ) e R{u*) - R(u k+1 ). 
Taking the inner product with (u* — u k+ i) gives us 

(u* - u k+1 ,M k (u k+1 - u k )) > (u* - u k+1 ,R(u*) - R(u k+1 )). 
Because R is monotone, the right hand side of the above equation is non-negative, and so 
(37) (u* - u k+1 ,M k (u k+ i - u k )) > 0. 
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Now, observe the identity 

IK - u *\\li k = IK+i - Ukf Mk + \\ u k+i - u*fu k 
+ 2(u k - u k+ i,u k+1 - u*) Mk . 

Applying (37) gives us 

(38) |K - "Ilk > IK+i - u*f Mk + |K+i - u k f Mk . 

□ 

We now show that the iterates generated by PDHG remain bounded. 
Lemma 3. Suppose the step sizes for Algorithm 1 satisfy conditions A, Band C. Then 

IK - u *\\ 2 H k < c u 

for some upper bound Cjj > 0. 

Proof. We first consider the case of condition CI . From (36) we have 

2{Ax,y) < ^ra\\ATA\\ op (^\\y\\ 2 + ^\\x\\ 2 ^ . 
Subtracting 2^J to^A t A^ ov {Ax, y) from both sides yields 

\ - 2 v /r < 7||^A|| op ) {Ax, y) < ^raWA\\ op (^-\\y\\ 2 + \\\x\\ 2 - 2(Ax,y) 



\m\J™\\A t A\\ op . 

Taking x = x k +i — x* , and y = y k +i — y*, t — r k+ i, and a = <r k +i we obtain 
2(A(x k+1 ~ x*),y k+1 - y*) < C x \\u k+1 - u*\\\. 



where C x = _Vlh+A°M22h = > 0. 

l-y/T k+1 a k + 1 \\A T A\\ op 

Applying this to the result of Lemma 2, yields 

IK-w*IIm* > \\ u k+i- u^wl^ 

= \\u k+1 - u*\\ 2 Hk + (A(x k+1 - x*),y k+1 - y*) 

> S k \\u k+1 - u*\\ 2 Hk+i + (A(x k+1 -x*),y k+1 — y*) 

> S k \\u k+1 -u*f Mk+1 + (1 - 5 k ){A(x k +i -x*),y k+1 -y*) 

> 4IK+1 - u*f Mk+1 - (1 - 4)Ci||u fe+ i - u*\\ 2 Mk+i 
= (l-(l + C 1 )cf >k )\\u k+1 -u*f Mk+i . 

It follows that 

ri-1 

(39) hi-u*\\ 2 Mi > II (!-(! + W*)IK-«1m„ • 

k = l 

Since J^k^k < oo, we have that J2 k {^ + Ci)4>k < oo, and the product on the right of (39) 
is bounded away from zero. Thus, there is a constant C 2 with 

\\ui-u*\\ 2 Mi \\ >C 2 \\u n -u*\\ 2 Mn >C 2 (l-C M )\\u n -u*\\ 2 Hn 

and the lemma is true in the case of assumption CI . 
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We now consider the case where condition C2 holds. We assume without loss of generality 
that Y is bounded (the case of bounded X follows by nearly identical arguments). In this 
case, we have \\y\\ < C y for all y £ Y. 

Note that 

IK+i -u*\\ Mk = — \\x k+ i -x*\\ 2 - {A{x k+l -x*),y k+x -y*) + —\\yk+i - y*f 

(40) > — \\x k +i - x*\\ 2 - 2C y \\A\\ op \\x k+1 - x*\\ + —\\y k +i - y*\\ 2 - 

T k a k 

When ||xfe+i — x*\\ grows sufficiently large, the term involving the square of this norm 
dominates the value of (40). Since {r k } and {cr k } are bounded from above, it follows that 
there is some positive C x such that 

— \\x k+1 - x*f + —\\y k+ i - y*\\ 2 > C x 

T k CTfe 

—\\xk+i - x*\\ 2 + —\\y k +i - y*\\ 2 > 4:C y \\A\\ op \\x k+1 -x*\\ 
T k a k 

(41) =► 2||« fc+ i - u*\\ Mk > -\\x k+1 - x*\\ 2 + — ||i/ fc+ i - y*\\ 2 . 

T k <T k 

Whenever ^||x fc+ i - x*\\ 2 + ^|K+i - y*\\ 2 > C x we have 

IK+i - u*\\ Mk = — \\x k+1 - x*\\ 2 - (A(x k+1 - x*),y k+1 - y k ) + —\\y k+ i - y*\\ 2 
T k a k 

S S 
> —\\x k+1 - x*\\ 2 ~ (A(x k+1 ~ x*),y k+1 - y*) + — — |K+i - y*\\ 2 

Tfc+l Cfc+1 

II *ll ^fe ll *n2 ^fe II *I|2 

= \\u k +i - u \\ Mk+1 |Ffc+i - x || WVk+i - V II 

Tk+1 ffe+1 

(42) > (l-2<j> k )\\u k+1 -u*\\ Mk+1 . 
Applying (42) to Lemma 2 yields 

(43) |K - "Ilk > (1 " 20fc)IK+i - 

Now, consider the case that ^-||affc+i — x*\\ 2 + ^\\yk+i ~ y*\\ 2 < C x . We have 

|K+i - u*\\ Mk > |K+i - u*\\ Mk+1 — ||a!fc + i - x*\\ 2 —\\Vk+i - y*\\ 2 

Tfe+l (Tk+1 

(44) > \\u k+1 - u*||M fc+1 - (\>kC x . 
Applying (44) to Lemma 2 yields 

(45) |K - u*f Mk > \\u k +i - u*f Mk+i - <f> k C x . 
From (43) and (45), it follows by induction that 

(46) \\u Q -u*\\ 2 Mk > [] (1- 2&)IK+i -«*llk +1 -E^ 

ielc i 

where I c = {i \ \\x i+1 - x*\\ > C x }. 
We can rearrange (46) to obtain 

H *|,2 , ~ "*Hk + C xY,^i . 
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which shows that |K — u *llM fc remains bounded (note: lim^oo 4>i — 0, and so we may 
assume without loss of generality that {1 — 2<fii} is positive). 

Finally, note that since {r k }, {o~ k }, and |K — u*\\M k are bounded from above, it follows 
from (40) that i \\x k — x*\\ 2 is bounded from above. But ^ \\y k — y*\\ 2 is also bounded from 
above, and so |K — u*\\H k is bounded as well. □ 

Lemma 3 established upper bounds on the sequence of iterates. To complete our conver- 
gence proof we also need lower bounds on |K — u*||| ffc . Note that in the case of indefinite 
Mfe, this quantity may be negative. The following result show that |K — u *I!m does not 
approach negative infinity. 

Lemma 4. Suppose the step sizes for Algorithm 1 satisfy conditions A , B , and C . Then 

\\u k+ i-u*f Mh > C L 

for some lower bound C^. 

Proof. In the case that CI holds, Lemma 1 tells us that M k is positive definite. In this case 
|| • || m is a proper norm and 

IK+i - u*f Mk >0. 

In the case that C2 holds, we have that either X or Y is bounded. Assume without loss 
of generality that Y is bounded. We then have ||y|| < C y for all y E Y. We can then obtain 

1 4C 2 
\\u k+1 - u*\\ Mk > —\\x k +i - x*\\ 2 - 2C y \\A\\ op \\x k+1 - i*|| H y - 

1 4C 2 
(47) > . r A x k+1 - x*\\ 2 - 2C„||i4||o P ||a; fc+ i - + . * . ■ 

mm{r fc } mm{crfe} 

Note that (47) is quadratic in ||a;fe+i — x*\\, and so this quantity is bounded from below. 



□ 

We are now ready to prove the main result. The idea is to show that the norms of 
the residuals have a finite sum and thus converge to zero. Because the "natural" norm of 
the problem, the M^-norm, changes after each iteration, we must be careful to bound the 
differences between the norms used at each iteration. For this purpose, condition B will be 
useful because it guarantees that the various Mfc-norms do not differ too much as k gets 
large. 

Theorem 1. Suppose that the stepsizes in Algorithm 1 satisfy conditions A and B, and 
either CI or C2 . Then the algorithm converges in the residuals, i.e. 

lim ||P fc || 2 + p fe || 2 =0. 

k— > oo 

Proof. Rearranging Lemma (37) gives us 

(48) |K - u *\\ 2 M k ~ IK+i - u *ll!r* > IK+i - ^fcllk- 

Summing (48) for 1 < k < n gives us 

n 

hi - u*f Mo - |K+i - u*f Mn + IK - "Ilk - IK - "Ilk-! 

(49) k t 



\u k+ i - u k \\ Mk 

k=l 
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fe=l 



Now, we expand the summation on the left side of (48) using the definition of M k to 
obtain 

n 

EK-ulk-K-ulk-! 

= £(- - — )IN - z*|| 2 + (- - — )||y fc - y*f 

" / 1 1 

< ^(1 - <5 fc -i) - *1 2 + -||tf fc - yl 



(50) 



fc=i 



= - u*||H fc < Cq }] <f>k-i < °°, 

fc=i fe=i 

where we have used the bound \\u k — u*\\ 2 Hk < Cjj from Lemma 3 . 

Applying (50) to (49), and noting from Lemma 4 that ||u n +i — m*||m„ is bounded from 
below, we see that 

n 

IK+i - u k \\ 2 Mk < oo, 

fe=i 

and it follows that limfe_ i . 00 — u/c||M fc = 0. If condition C2 holds, then this clearly 

implies that 

(51) lim ||ufc+i -Uk\\H k =0. 

k— >oo 

If condition CI holds, then we still have (51) from Lemma 1. Since r k and a k are bounded 
from above, (51) implies that 

(52) lim \\u k+1 -u k \\ =0. 

k— >-oo 

Recall the definition of R k in equation (29). From (31), we know that 

R k = M k (u k - Uk+i) € R(u k +i), 

and so 



(53) lim \\R k \\ = lim \\M(u k+1 - u k )\\ < max{||M fc ||} lim ||u fc+1 -u fc || = 0. 

k—t-oc k—toc k k—too 



a 



7. Numerical Results 

To demonstrate the performance of the new adaptive PDHG schemes, we apply them to 
the test problems described in Section 3. We run the algorithms with parameters a$ = 0.5, 
A = 1.5, and 7/ = 0.95. The backtracking method is run with 7 = .75, /3 = .95, and we 
initialize stepsizes using formula (27). We terminate the algorithms when both the primal 
and dual residual norms (i.e. \P k \ and \D k \) are smaller than 0.05, unless otherwise specified. 

We consider four variants of PDHG. The method "Adapt:Backtrack" denotes the adaptive 
backtracking method (Algorithm 3) with large initial stepsizes chosen according to (27). The 
method "Adapt: rcr = L" refers to the adaptive method without backtracking (Algorithm 
2) with T = 0-0 = 0.95p(A T A)-i. 

We also consider the non-adaptive PDHG with two different stepsize choices. The method 
"Const: r, a = refers to the constant-stepsize method with both stepsize parameters 
equal to \L = p(A T A)~?. The method "Const: r-final" refers to the constant-stepsize 
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Figure 1. (left) Convergence curves for the Rudin-Osher-Fatemi denoising 
experiment with /i = 0.05. The y-axis displays the difference between the 
value of the ROF objective function (f 2) at the fcth iterate and the optimal 
objective value, (right) Stepsizc sequences, {t^}, for both adaptive schemes. 

method, where the stepsizes are chosen to be the final values of the stepsizes used by 
"Adapt: rer = L" . This final method is meant to demonstrate the performance on PDHG 
with a stepsize that is customized to the problem at hand, but still non-adaptive. 

7.1. Experiments. The specifics of each test problem are described below: 

Rudin-Osher-Fatemi Denoising. We apply the denoising model (12) to the "Cameraman" 
test image. The image is scaled to have pixels in the range [0, 255], and contaminated with 
Gaussian noise of standard deviation 10. The image is denoised with fi — 0.25, 0.05, and 

0. 01. 

We display denoised images in Figure 2. We show results of numerical time trails in Table 

1. Note that the iteration count for denoising problems increases for small fx, which results 
in solutions with large piecewise -constant regions. Note also the similar performance of 
Algorithms 2 and 3, indicating that there is no significant advantage to knowing the constant 
L = P {A T A)-\ 

We plot convergence curves and show the evolution of Tfc in Figure 1. Note that is 
large for the first several iterates and then decays over time. This behavior is typical for 
many TV-regularized problems. 

TVL1 Denoising. We again denoise the "Cameraman" test image, this time using the model 
(17), which tends to result in smoother results. The image is denoised with fi = 2, 1, and 
0.5. 

We display denoised images in Figure 2, and show time trials results in Table 1. Much 
like in the ROF case, the iteration counts increase as denoising results get coarser (i.e. when 
H gets small.) There is no significant advantage to specifying the value of L = p(A T A) T , 
because the backtracking algorithm was very effective for this problem, much like in the 
ROF case. 

Convex Segmentation. We apply the model (19) to a test image containing circular regions 
organized in a triangular pattern. By choosing different weights for the data term /x, we 
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ROF TVL1 




fi = 0.01 n = 0.5 



Figure 2. Results of denoising experiments with cameraman image, (left 
column) ROF results with fi — 0.25, 0.05, and 0.01 from top to bottom, 
(right column) TVL1 results with f/, = 2, 1, and 0.5 from top to bottom. 

achieve segmentations at different scales. In this case, we can identify each circular region 
as its own entity, or we can groups regions together into groups of 3 or 9 circles. 

Results of segmentations at different scales are presented in Figure 3. Convergence curves 
are shown in Figure 4 for both the energy and residuals. Note that the residuals exhibit 
a similar shape to the energy convergence curves, indicating that the residuals are a good 
measure of convergence. 
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Figure 3. Segmentation of the "circles" test image at different scales. 




Figure 4. (left) Convergence curves for the segmentation experiment with 
fi = 0.15. (right) Primal and dual residual norms for the backtracking 
adaptive method. 

Compressed Sensing. We reconstruct a Shepp-Logan phantom from sub-sampled Hadamard 
measurements. Data is generated by applying the Hadamard transform to a 256 x 256 
discretization of the Shepp-Logan phantom, and then sub-sampling 5%, 10%, and 20% of 
the coefficients are random. We scaled the image to have pixels in the range [0,255], the 
orthogonal scaling of the Hadamard transform was used, and we reconstruct all images with 
}i=l. The compressive reconstruction with 10% sampling is shown in Figure 5. See Table 
1 for iteration counts at the various sampling rates. 

£oo Minimization. To demonstrate the performance of the adaptive schemes with complex 
inputs, we solve a signal approximation problem using a complex-valued tight frame. Prob- 
lems of the form (22) were generated by randomly sub-sampling 100 rows from a discrete 
Fourier matrix of dimension 512. The input signal z was a vector on random Gaussian vari- 
ables. Problems were solved with approximation accuracy e = 1,0.1, and 0.01, and results 
are reported in Table 1. 

Because this problem is complex- valued, we use the definition of bk given by (28) to 
guarantee that the stepsizes are real-valued. 

General Linear Programming. We test our algorithm on the standard test problem "sc50b" 
that ships with most distributions of Matlab. The problem is to recover 40 unknowns subject 
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Figure 5. Compressed sensing reconstruction experiment, (left) Original 
Shepp-Logan phantom, (right) Image reconstructed from a random 10% 
sample of noisy Hadamard coefficients. Images are depicted in false color 
to accentuate small features. 



to 30 inequality and 20 equality constraints. To accelerate the method, we apply a precon- 
ditioner to the problem inspired by the diagonal stepsize proposed for linear programming 
in [25]. This preconditioner works by replacing A and b in (23) with 

A = T^A^ 1 h = rh 

where T and £ are diagonal preconditioning matrices with Ta = X^l^ijl an< ^ ^ij = 

Time trial results are shown in Table 1. Note that PDHG is not as efficient as conventional 
linear programing methods for small-scale problems; the Matlab "linprog" command took 
approximately 0.05 seconds to solve this problem using interior point methods. Interestingly, 
the backtracking scheme (Algorithm 3) out-performed Algorithm 2 for this problem (see 
Table 1). 

Table 1. Time Trial Results. Iteration counts are reported for each prob- 
lem/algorithm, with total runtime (sec) in parenthesis. 



Problem 


Adapt :Backtrack 


Adapt: t(t = L 


Const: r, a = \fh 


Const: T-final 


ROF, fi = 


.25 


16 (0.0475) 


16 (0.041) 


78 (0.184) 


48 (0.121) 


ROF, fi = 


.05 


50 (0.122)) 


51 (0.122) 


281 (0.669) 


97 (0.228) 


ROF, p = 


.01 


109 (0.262) 


122 (0.288) 


927 (2.17) 


152 (0.369) 


TVL1, fj, 


= 2 


286 (0.957) 


285 (0.954) 


852 (2.84) 


380 (1.27) 


TVL1, y, 


= 1 


523 (1.78) 


521 (1.73) 


1522 (5.066) 


669 (2.21) 


TVL1, n -- 


= .5 


846 (2.74) 


925 (3.07) 


3244 (10.8) 


1363 (4.55) 


Segment, fi 


= 0.5 


42 (0.420) 


41 (0.407) 


114 (1.13) 


53 (0.520) 


Segment, \x 


= .15 


111 (1.13) 


107 (1.07) 


493 (5.03) 


131 (1.34) 


Segment, \x 


= .08 


721 (7.30) 


1016 (10.3) 


706 (7.13) 


882 (8.93) 


Compressive 


(20%) 


163 (4.08) 


168 (4.12) 


501 (12.54) 


246 (6.03) 


Compressive 


(10%) 


244 (5.63) 


274 (6.21) 


908 (20.6) 


437 (9.94) 


Compressive 


(5%) 


382 (9.54) 


438 (10.7) 


1505 (34.2) 


435 (9.95) 


t x (e = 


1) 


86 (0.0675) 


56 (0.0266) 


178 (0.0756) 


38 (0.0185) 


^ (e = 


•1) 


79 (0.0360) 


72 (0.0309) 


195 (0.0812) 


78 (0.0336) 


loo (e = 


01) 


89 (0.0407) 


90 (0.0392) 


200 (0.0833) 


82 (0.0364) 


LP 




18255 (3.98) 


20926 (4.67) 


24346 (5.42) 


29357 (6.56) 
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7.2. Discussion. Several interesting observations can be made from the results in Table 1. 
First, both Algorithm 3 ("Adapt:Backtrack") and 2 ("Adapt: tct = L") have similar per- 
formance on average for the imaging problems, with neither algorithm showing consistently 
better performance than the other. This shows that the backtracking scheme (Algorithm 3) 
is highly effective and that there is no significant advantage to requiring the user to supply 
the constant p(A T A) other than for simplicity of implementation. 

Interestingly, when the backtracking method did show better performance, it was often 
due to the ability to use large stepsizes that violate the stepsize restriction CI . When solving 
the ROF denoising problem with fi = 0.01, for example, the backtracking scheme terminated 
with rfcCTfc = 0.14, while the conventional stepsize restriction is 1/ p(A T A) = 1/8 = 0.125. 

There are two significant observations to be made from Figure 4, which depicts the 
convergence of the residuals for an instance of the segmentation problem. First, the residuals 
have the same overall shape as the energy convergence curves. This indicates that the 
residuals serve as a good proxy for the objective function, and are thus a reasonable measure 
of convergence. Second, the residuals of the initial iterates in Figure 4 differ in scale by 
almost two orders of magnitude. After the first few iterations of the method, the residual 
balancing scheme maintains that the primal and dual residuals in Figure 4 remain roughly 
the same size within a factor of A = 1.5. 

Finally, the method "Const: r-final" (a non-adaptive method using the "optimized" 
stepsizes obtained from the last iterations of the adaptive scheme) did not always outperform 
the non-adaptive scheme using the non-optimized stepsizes. This occurs because the true 
"best" stepsize choice depends on the active set of the problem in addition to the structure 
of the remaining error and thus evolves over time. This situation is depicted in Figure 1, 
which shows the evolution of over time for the adaptive methods. Note that for this 
problem, both methods favor a large value for Tfc during the first few iterations, and this 
value decays over time. Behavior like this is typical for the imaging problems discusses above. 
This illustrates the importance of an adaptive stepsize — by optimizing the stepsizes to the 
current active set and error levels, adaptive methods can achieve superior performance. 

8. Conclusion 

We have introduced new adaptive variants of the powerful Primal-Dual Hybrid Gradi- 
ent (PDHG) schemes. We proved rigorous convergence results for adaptive methods and 
identified the necessary conditions for convergence. We also presented practical implemen- 
tations of adaptive schemes that formally satisfy the convergence requirements. Numerical 
experiments show that adaptive PDHG methods have advantages over non-adaptive imple- 
mentations in terms of both efficiency and simplicity for the user. 

The new backtracking variant of PDHG is advantageous because it does not require the 
user to supply any information about the problem instance being solved such as the spectral 
radius of the matrix A. This is particularly useful when creating "black-box" solvers for large 
classes of problems. For example, a generic linear programming solver should not require the 
user to supply information about the eigenvalues of the constraint matrix. Furthermore, this 
flexibility seems to come at no added cost; the fully adaptive backtracking scheme performs 
at least as well as methods that do require spectral information. 
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